Here we look at using InferCNV
to identify tumor and normal cells in SCPCS000490.
Before running this notebook, InferCNV is run using the
run-infercnv.R script. This script runs
InferCNV once with Endothelial cells as the reference cells
and once without any cells as the reference.
Here, we look the copy number variations found reported by
InferCNV and use that copy number information to call cells
as tumor or normal cells. We then compare these classifications to
manual classifications and annotations from SingleR and
CellAssign.
suppressPackageStartupMessages({
# load required packages
library(SingleCellExperiment)
library(ggplot2)
library(dplyr)
})
# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)
# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")
sample_dir <- file.path(data_dir, "SCPCP000015", params$sample_id)
# The path to this module
module_base <- file.path(repository_base, "analyses", "ewings-cell-type")
# source in helper functions
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
# Input files
sce_filename <- glue::glue("{params$library_id}_processed.rds")
sce_file <- file.path(sample_dir, sce_filename)
# tumor normal classifications
classifications_results_dir <- file.path(module_base, "results", "marker_gene_analysis")
classifications_filename <- glue::glue("{params$library_id}_tumor_normal_classifications.tsv")
classifications_file <- file.path(classifications_results_dir, classifications_filename)
# output from running infercnv
results_dir <- file.path(module_base, "results", "infercnv", params$library_id)
all_results_dir_list <- c(
no_ref = file.path(results_dir, "no_reference"),
with_ref = file.path(results_dir, "endothelial_reference")
)
# read in sce object
sce <- readr::read_rds(sce_file)
# read in tumor normal classifications
manual_classifications_df <- readr::read_tsv(classifications_file)
## Rows: 4290 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): barcodes, marker_gene_classification, cluster_classification
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Below is the heatmap produced by InferCNV showing the
CNV’s for each cell using endothelial cells as a reference. Each row is
a cell and each column is a gene. The top group contains all cells that
were used as the reference group and the bottom is all other cells. The
color indicates the relative expression intensity of each gene with the
darker colors either indicating a gain of expression or loss of
expression compared to the reference cells.
From this heatmap we see that there is a distinct group of cells
within the bottom portion of the plot that appear to have a lot more
CNVs than the rest of the cells. It’s probable that these correspond to
the tumor cells, but InferCNV does not provide any
classification of cells as tumor or normal, it just provides information
on copy number variations.
Below is the heatmap produced by InferCNV showing the
CNV’s for each cell without using a reference.
From the InferCNV
documentation:
Note, if you do not have reference cells, you can set ref_group_names=NULL, in which case the average signal across all cells will be used to define the baseline. This can work well when there are sufficient differences among the cells included (ie. they do not all show a chromosomal deletion at the same place).
Given this information, I wouldn’t expect the CNV inference to work well because we see that all cells that do have prominent CNVs have the same CNVs. There is not a lot of variation.
Looking at this we see a small group of cells that may have some
variations, but they are different from the variations seen above. I
don’t know that using this without a reference is the best call since
it’s unlikely that it will identify distinct groups of cells from tumors
with quiet genomes with no baseline. The rest of the analysis will use
the results from running InferCNV with endothelial cells as
the reference.
InferCNV uses an HMM-based model to predict whether or
not a copy number variation (either loss or gain) is present for each
chromosome in each cell. For each predicted CNV region, a posterior
probability is calculated for that CNV for each cell.
From the InferCNV
documentation:
CNV regions with mean posterior probabilities of being normal (no CNV) that are above a maximum threshold are removed as likely false positive predictions.
The predictions information lies in a few files that live in the
output directory. We want to know which cells had CNVs and how many CNVs
were found. To map the information about CNVs to individual cells, they
recommend using the infercnv::add_to_seurat() function
which takes the output directory from running InferCNV and
reads in all necessary files and adds the per cell CNV information to
the meta.data. So we will start by converting our processed
SCE object to a Seurat object, incorporate the CNV information, and then
add the CNV information to the SCE object.
infer_cnv_dir <- all_results_dir_list["with_ref"]
# first create a Seurat object from our SCE object
seurat_obj <- scpcaTools::sce_to_seurat(sce)
# add in the CNV information
# this adds new columns to the meta.data indicating presence of CNVs for each chromosome
seurat_obj <- infercnv::add_to_seurat(seurat_obj = seurat_obj,
infercnv_output_path = infer_cnv_dir)
# grab metadata from Seurat and add back to SCE
seurat_metadata <- seurat_obj@meta.data |>
dplyr::select(-c(orig.ident, nCount_RNA, nFeature_RNA))
colData(sce) <- DataFrame(seurat_metadata, row.names = seurat_metadata$barcodes)
The metadata that we added includes the following columns for each chromosome:
has_cnv_chr1has_loss_chr1has_dupli_chr1proportion_cnv_chr1proportion_loss_chr1proportion_dupli_chr1proportion_scaled_cnv_chr1proportion_scaled_loss_chr1proportion_scaled_dupli_chr1The has columns are boolean columns indicating if that
cell has presence of a CNV, loss, or duplication event. If
has_cnv is TRUE if either has_loss or
has_dupli is TRUE.
See the comments in this issue explaining what the contents of the columns are.
One approach to classifying tumor cells is to look for cells that have a lot of CNVs. Let’s start by just summing up the total CNVs/cell and plotting that.
# make df for plotting
coldata_df <- sce |>
scuttle::makePerCellDF(use.dimred = "UMAP") |>
# replace UMAP.1 with UMAP1
dplyr::rename_with(
\(x) stringr::str_replace(x, "^UMAP\\.", "UMAP")
)
# remove MT and GL chr, comments in docs recommend that these are not useful and can be inaccurate
chr_columns_to_keep <- which(!stringr::str_detect(names(coldata_df), "chrMT|chrGL000219.1"))
cnv_df <- coldata_df[, chr_columns_to_keep]
# grab any columns that have has_cnv
has_cnv_cols <- names(cnv_df)[startsWith(colnames(cnv_df), "has_cnv")]
cnv_df <- cnv_df |>
# get total number of cnvs per cell and then scale that number
dplyr::mutate(has_cnv = rowSums(across(has_cnv_cols)),
scaled_cnv = scale(has_cnv))
# plot the total number of CNVs
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = has_cnv)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw() +
scale_color_viridis_c()
This shows that the total CNVs are highest in the group of cells that
correspond to the same cells that are manually annotated as tumor cells,
which is promising. If we were to use total CNVs we would need some sort
of cut off for the number. Let’s look at the distribution of total CNVs
and also the scaled total CNVs to see if we can identify a cut off.
ggplot(cnv_df, aes(x = has_cnv)) +
geom_density() +
theme_bw() +
labs(x = "Total number of CNVs")
ggplot(cnv_df, aes(x= scaled_cnv)) +
geom_density() +
theme_bw() +
labs(x = "Scaled number of CNVS")
Looking at the scaled number of CNVs it does look like we could use 0 as the cutoff and call any cell that has > 0 CNVs (after scaling) as tumor and all other cells as normal.
# add cnv classification
cnv_df <- cnv_df |>
dplyr::mutate(cnv_classification = dplyr::if_else(scaled_cnv > 0, "Tumor", "Normal"))
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = cnv_classification)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw()
Another option is to use the scaled proportion of CNVs which tells us the proportion of genes on that chromosome that are part of the CNV.
proportion_cols <- names(cnv_df)[startsWith(colnames(cnv_df), "proportion_scaled_cnv")]
# get mean proportion of cnvs per cell and then scale that number
cnv_df <- cnv_df |>
dplyr::mutate(mean_proportion = rowMeans(across(proportion_cols)),
scaled_mean_proportion = scale(mean_proportion))
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = mean_proportion)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw() +
scale_color_viridis_c()
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = scaled_mean_proportion)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw() +
scale_color_viridis_c()
ggplot(cnv_df, aes(x = scaled_mean_proportion)) +
geom_density() +
theme_bw()
It looks like maybe we could again use 0 as the cutoff here and anything
above 0 is a tumor cell. Let’s see how this matches up with the
tumor/normal classifications we defined just using total CNVs.
ggplot(cnv_df, aes(x = scaled_mean_proportion, color = cnv_classification)) +
geom_density() +
theme_bw()
Interestingly, the cells we labeled as tumor have a higher mean proportion value then those that we labeled as normal. I’m not really sure which is “better” to use the mean proportion or just the detection of CNVs to call tumor cells.
For the rest of the notebook, we will use the classifications of tumor cells based on the sum of the total CNV occurrences.
Below we can compare these annotations from InferCNV to classifying tumor cells manually with marker genes.
all_classifications_df <- cnv_df |>
dplyr::left_join(manual_classifications_df) |>
# relevel classifications for confusion matrix
dplyr::mutate(
cnv_classification = forcats::fct_relevel(cnv_classification, "Tumor"),
marker_gene_classification = forcats::fct_relevel(marker_gene_classification, "Tumor")
)
## Joining with `by = join_by(barcodes)`
caret::confusionMatrix(
table(
all_classifications_df$cnv_classification,
all_classifications_df$marker_gene_classification
)
)
## Confusion Matrix and Statistics
##
##
## Tumor Normal
## Tumor 2293 465
## Normal 29 1503
##
## Accuracy : 0.8848
## 95% CI : (0.8749, 0.8943)
## No Information Rate : 0.5413
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7641
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9875
## Specificity : 0.7637
## Pos Pred Value : 0.8314
## Neg Pred Value : 0.9811
## Prevalence : 0.5413
## Detection Rate : 0.5345
## Detection Prevalence : 0.6429
## Balanced Accuracy : 0.8756
##
## 'Positive' Class : Tumor
##
This looks pretty consistent with what we see when we annotate cells manually.
We can also compare to the existing annotations obtained using
SingleR and CellAssign as part of
scpca-nf.
celltype_columns <- c(
"singler_celltype_annotation",
"cellassign_celltype_annotation"
)
# create jaccard matrices for SingleR and CellAssign compared to aneuploid/diploid
jaccard_matrices <- celltype_columns |>
purrr::map(\(name) {
make_jaccard_matrix(
cnv_df,
"cnv_classification",
name
)
}) |>
purrr::set_names("SingleR", "CellAssign")
# Set heatmap padding option
heatmap_padding <- 0.2
ComplexHeatmap::ht_opt(TITLE_PADDING = grid::unit(heatmap_padding, "in"))
# list of heatmaps looking at SingleR/ CellAssign vs tumor/normal
heatmap <- jaccard_matrices |>
purrr::imap(
\(celltype_mat, celltype_method) {
ComplexHeatmap::Heatmap(
t(celltype_mat), # transpose because matrix rows are in common & we want a vertical arrangement
col = circlize::colorRamp2(c(0, 1), colors = c("white", "darkslateblue")),
border = TRUE,
## Row parameters
cluster_rows = TRUE,
row_title = celltype_method,
row_title_gp = grid::gpar(fontsize = 12),
row_title_side = "left",
row_names_side = "left",
row_dend_side = "right",
row_names_gp = grid::gpar(fontsize = 10),
## Column parameters
cluster_columns = FALSE,
column_title = "",
column_title_gp = grid::gpar(fontsize = 12),
column_names_side = "bottom",
column_names_gp = grid::gpar(fontsize = 10),
column_names_rot = 90,
## Legend parameters
heatmap_legend_param = list(
title = "Jaccard index",
direction = "vertical",
legend_width = unit(1.5, "in")
),
show_heatmap_legend = celltype_method == "SingleR",
)
}) |>
# concatenate vertically into HeatmapList object
purrr::reduce(ComplexHeatmap::`%v%`) |>
ComplexHeatmap::draw(
heatmap_legend_side = "right",
# add a margin to the heatmap so labels don't get cut off
padding = unit(c(2, 20, 2, 2), "mm")
)
This looks as we expect with mostly muscle cells lining up with tumor cells.
There are a few known copy number variations in Ewing’s sarcoma:
Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations. See Tirode et al., and Crompton et al..
Below we will plot whether or not the gain or loss is present in each
cell according to InferCNV.
# umaps showing where cells are with gains/losses
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_dupli_chr8)) +
geom_point(alpha = 0.5, size = 0.5) +
labs(title = "Chr8 gain") +
theme_bw()
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_dupli_chr12)) +
geom_point(alpha = 0.5, size = 0.5) +
labs(title = "Chr12 gain") +
theme_bw()
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_dupli_chr1)) +
geom_point(alpha = 0.5, size = 0.5) +
labs(title = "Chr1 gain") +
theme_bw()
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = has_loss_chr16)) +
geom_point(alpha = 0.5, size = 0.5) +
labs(title = "Chr16 loss") +
theme_bw()
From this we see that the group of tumor cells at the bottom have chr8 and chr1 gains with some having gains of chr12. It doesn’t look like any cells have loss of chr16. The other interesting thing is that there is a group of cells that have gains of chr8, chr1, and chr12 that we did not classify as tumor cells. Should they be…
We can also look at the proportion of the chromosome with the gain.
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = proportion_scaled_dupli_chr8)) +
geom_point(alpha = 0.5, size = 0.5) +
labs(title = "Chr8 gain") +
theme_bw() +
scale_color_viridis_c()
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = proportion_scaled_dupli_chr12)) +
geom_point(alpha = 0.5, size = 0.5) +
labs(title = "Chr12 gain") +
theme_bw() +
scale_color_viridis_c()
ggplot(cnv_df, aes(x= UMAP1, y = UMAP2, color = proportion_scaled_dupli_chr1)) +
geom_point(alpha = 0.5, size = 0.5) +
labs(title = "Chr1 gain") +
theme_bw() +
scale_color_viridis_c()
Now when we look at the proportion, we see that only the group of tumor cells on the bottom seem to have a higher proportion of genes in chr8 that have a gain in expression. Based on the papers, we should see a gain in the whole chr8 and chr12, not just one arm or specific genes, so we expect a high value for the proportion. We see this for chr8 but not in chr12. However, when you look at the heatmap you see that most of Chr8 is included in the gain, but only a small part of Chr12 has a gain, so these plots are not surprising.
I think with using the combination of marker gene expression and CNVs, the bottom group of cells represent the tumor cells.
InferCNV can
be helpful in identifying cells that could be tumor cells, e.g., cells
that have a high number of CNVs.InferCNV without a group of normal cells does not
seem feasible for this particular sample since all tumor cells appear to
have the same set of CNVs with little variation.InferCNV can be helpful in looking at specific CNVs
that are expected in a tumor type, something that is harder to look at
in a tool like CopyKAT.# record the versions of the packages used in this analysis and other environment information
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] dplyr_1.1.4 ggplot2_3.5.1 SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [5] Biobase_2.64.0 GenomicRanges_1.56.0 GenomeInfoDb_1.40.0 IRanges_2.38.0
## [9] S4Vectors_0.42.0 BiocGenerics_0.50.0 MatrixGenerics_1.16.0 matrixStats_1.3.0
##
## loaded via a namespace (and not attached):
## [1] spatstat.sparse_3.0-3 bitops_1.0-7 lubridate_1.9.3 httr_1.4.7 RColorBrewer_1.1-3
## [6] doParallel_1.0.17 tools_4.4.0 sctransform_0.4.1 utf8_1.2.4 R6_2.5.1
## [11] lazyeval_0.2.2 uwot_0.2.2 scpcaTools_0.3.2 GetoptLong_1.0.5 withr_3.0.0
## [16] sp_2.1-4 gridExtra_2.3 parallelDist_0.2.6 progressr_0.14.0 argparse_2.2.3
## [21] cli_3.6.2 formatR_1.14 spatstat.explore_3.2-7 fastDummies_1.7.3 sandwich_3.1-0
## [26] sass_0.4.9 labeling_0.4.3 Seurat_5.0.3 mvtnorm_1.2-4 spatstat.data_3.0-4
## [31] readr_2.1.5 proxy_0.4-27 ggridges_0.5.6 pbapply_1.7-2 parallelly_1.37.1
## [36] limma_3.60.0 rstudioapi_0.16.0 generics_0.1.3 shape_1.4.6.1 vroom_1.6.5
## [41] gtools_3.9.5 ica_1.0-3 spatstat.random_3.2-3 Matrix_1.7-0 futile.logger_1.4.3
## [46] fansi_1.0.6 abind_1.4-5 lifecycle_1.0.4 infercnv_1.20.0 multcomp_1.4-25
## [51] yaml_2.3.8 edgeR_4.2.0 gplots_3.1.3.1 recipes_1.0.10 SparseArray_1.4.3
## [56] Rtsne_0.17 grid_4.4.0 promises_1.3.0 crayon_1.5.2 miniUI_0.1.1.1
## [61] lattice_0.22-6 beachmat_2.20.0 cowplot_1.1.3 pillar_1.9.0 knitr_1.46
## [66] ComplexHeatmap_2.20.0 rjson_0.2.21 future.apply_1.11.2 codetools_0.2-20 leiden_0.4.3.1
## [71] glue_1.7.0 data.table_1.15.4 vctrs_0.6.5 png_0.1-8 spam_2.10-0
## [76] gtable_0.3.5 cachem_1.0.8 gower_1.0.1 xfun_0.43 S4Arrays_1.4.0
## [81] mime_0.12 prodlim_2023.08.28 libcoin_1.0-10 coda_0.19-4.1 survival_3.5-8
## [86] timeDate_4032.109 iterators_1.0.14 hardhat_1.3.1 lava_1.8.0 statmod_1.5.0
## [91] fitdistrplus_1.1-11 TH.data_1.1-2 ROCR_1.0-11 ipred_0.9-14 nlme_3.1-164
## [96] bit64_4.0.5 RcppAnnoy_0.0.22 rprojroot_2.0.4 bslib_0.7.0 irlba_2.3.5.1
## [101] KernSmooth_2.23-22 rpart_4.1.23 colorspace_2.1-0 nnet_7.3-19 tidyselect_1.2.1
## [106] bit_4.0.5 compiler_4.4.0 DelayedArray_0.30.1 plotly_4.10.4 scales_1.3.0
## [111] caTools_1.18.2 lmtest_0.9-40 stringr_1.5.1 digest_0.6.35 goftest_1.2-3
## [116] spatstat.utils_3.0-4 rmarkdown_2.26 XVector_0.44.0 htmltools_0.5.8.1 pkgconfig_2.0.3
## [121] sparseMatrixStats_1.16.0 highr_0.10 fastmap_1.1.1 rlang_1.1.3 GlobalOptions_0.1.2
## [126] htmlwidgets_1.6.4 UCSC.utils_1.0.0 shiny_1.8.1.1 DelayedMatrixStats_1.26.0 jquerylib_0.1.4
## [131] farver_2.1.1 zoo_1.8-12 jsonlite_1.8.8 BiocParallel_1.38.0 ModelMetrics_1.2.2.2
## [136] magrittr_2.0.3 modeltools_0.2-23 scuttle_1.14.0 GenomeInfoDbData_1.2.12 dotCall64_1.1-1
## [141] patchwork_1.2.0 munsell_0.5.1 Rcpp_1.0.12 ape_5.8 reticulate_1.36.1
## [146] stringi_1.8.4 pROC_1.18.5 zlibbioc_1.50.0 MASS_7.3-60.2 plyr_1.8.9
## [151] parallel_4.4.0 listenv_0.9.1 ggrepel_0.9.5 forcats_1.0.0 deldir_2.0-4
## [156] splines_4.4.0 tensor_1.5 hms_1.1.3 circlize_0.4.16 locfit_1.5-9.9
## [161] igraph_2.0.3 fastcluster_1.2.6 spatstat.geom_3.2-9 RcppHNSW_0.6.0 reshape2_1.4.4
## [166] futile.options_1.0.1 evaluate_0.23 SeuratObject_5.0.1 RcppParallel_5.1.7 lambda.r_1.2.4
## [171] renv_1.0.7 BiocManager_1.30.23 tzdb_0.4.0 phyclust_0.1-34 foreach_1.5.2
## [176] httpuv_1.6.15 RANN_2.6.1 tidyr_1.3.1 purrr_1.0.2 polyclip_1.10-6
## [181] clue_0.3-65 future_1.33.2 scattermore_1.2 coin_1.4-3 xtable_1.8-4
## [186] e1071_1.7-14 RSpectra_0.16-1 later_1.3.2 viridisLite_0.4.2 class_7.3-22
## [191] rjags_4-15 tibble_3.2.1 cluster_2.1.6 timechange_0.3.0 globals_0.16.3
## [196] caret_6.0-94